# Blackjack 
- The library **learning** implements several reinforcement learning algorithm (e.g. sarsa, qlearning, MC, sarsa_lambda, and q_lambda). 
- We will compare the SARSA and Q-learning algorithm to make an agent play the blakjack game.
- The **objective** of the exercise if for you to understand and explain how both algorithms are used in playing this game. You may use the following reference for SARSA and Q-learning algorithm (https://tcnguyen.github.io/reinforcement_learning/sarsa_vs_q_learning.html)

### Import dependencies
- All python implementations are in the folder **learning**. 
- Read carefully the SARSA and Q-learning implementation (also used in exercise-01). Make some comments about the implementation.

In [None]:
import numpy as np
from itertools import product

# This functions are inside the folder named learning.
from learning.model_free import Problem
from learning.model_free import sarsa
from learning.model_free import qlearning

In [None]:
# Write your comments in this cell
'''



'''

### Create the game class
- This class inherits from the class **Problem** and adds simple methods to define the components of the game blackjack.
- As comments, clearly state the components of the Markov Decision Process used for the blackjack game.

In [None]:
class BlackJack(Problem):

    def __init__(self):
        # Sum of player's cards, dealer's showing card, usable ace
        self.states = [(-1, -1, -1)]
        self.states += [(i, j, k)
                        for (i, j, k) in product(range(12, 22), range(1, 11), [0, 1])]
        self.a = ['hit', 'stick']

        self.states_map = {s: i for i, s in enumerate(self.states)}

        Problem.__init__(self, len(self.states), len(self.a))

    def get_card(self):
        return min(10, np.random.randint(1, 14))

    def sample_initial_state(self):
        my_card = np.random.randint(12, 22)
        dealer_showing = self.get_card()
        usable_ace = np.random.randint(0, 2)

        return self.states_map[(my_card, dealer_showing, usable_ace)]

    def hand_value(self, sum_cards, usable_ace):
        if usable_ace and sum_cards > 21:
            return sum_cards - 10
        return sum_cards

    def actions(self, s):
        (my_sum, _, usable_ace) = self.states[s]

        if self.hand_value(my_sum, usable_ace) >= 21:
            return [1]
        else:
            return [0, 1]

    def is_final(self, s):
        return s == 0

    # Computes the next state and reward pair and whether the state is final
    def state_reward(self, s, a):
        (my_sum, dealer_card, usable_ace) = self.states[s]
        next_s = self.states_map[(my_sum, dealer_card, usable_ace)]

        if a == 1:  # Stick
            if self.hand_value(my_sum, usable_ace) > 21:
                return 0, -1

            dealer_sum = dealer_card
            dealer_usable_ace = 0
            if dealer_card == 1:
                dealer_sum += 10
                dealer_usable_ace = 1

            while self.hand_value(dealer_sum, dealer_usable_ace) < self.hand_value(my_sum, usable_ace):
                card = self.get_card()
                dealer_sum += card
                if card == 1:
                    dealer_sum += 10

                if card == 1 or dealer_usable_ace:
                    if dealer_sum <= 21:
                        dealer_usable_ace = 1
                    else:
                        dealer_sum -= 10
                        dealer_usable_ace = 0

                if self.hand_value(dealer_sum, dealer_usable_ace) == self.hand_value(my_sum, usable_ace) == 17:
                    return 0, 0

            if self.hand_value(dealer_sum, dealer_usable_ace) > 21:
                return 0, 1

            if self.hand_value(dealer_sum, dealer_usable_ace) == self.hand_value(my_sum, usable_ace):
                return 0, 0

            if self.hand_value(dealer_sum, dealer_usable_ace) < self.hand_value(my_sum, usable_ace):
                return 0, 1

            # if dealer_sum > my_sum:
            return 0, -1
        else:  # Hit
            card = self.get_card()
            my_sum += card
            if card == 1:
                my_sum += 10

            if card == 1 or usable_ace:
                if my_sum <= 21:
                    usable_ace = 1
                else:
                    my_sum -= 10
                    usable_ace = 0

            if self.hand_value(my_sum, usable_ace) > 21:
                return 0, -1

            # Only nonterminal case
            next_s = self.states_map[(my_sum, dealer_card, usable_ace)]
            return next_s, 0

        raise Exception('Unexpected state/action pair')

    def print_policy(self, policy):
        print('Usable ace:')
        for i, state in enumerate(self.states):
            if state[2]:
                print('Hand value: {0}, Dealer Showing: {1}, Action: {2}'.format(
                    self.hand_value(state[0], 1), state[1], self.a[policy[i]]))

        print('No usable ace:')
        for i, state in enumerate(self.states):
            if not state[2]:
                print('Hand value: {0}, Dealer Showing: {1}, Action: {2}'.format(
                    self.hand_value(state[0], 0), state[1], self.a[policy[i]]))

    def print_values(self, values):
        for i in range(len(values)):
            print('State {0}. Value: {1}'.format(self.states[i], values[i]))

In [None]:
# Write your comments in this cell
'''


'''

### Main function with different tests
- Modify this code to test different scenarios and hyperparameters.
- Also, compare both algorithms by evaluating the evolution of the reward in each trial/step of the process. 
- Comment about the differences.

In [None]:
# Write your comments in this cell
'''


'''

In [None]:
def main():
    problem = BlackJack()

    print("****************************************************")
    print("SARSA algorithm")
    print("****************************************************")
    pi, v = sarsa(problem, 10000, epsilon=0.1, alpha=0.1, gamma=1.0)

    problem.print_policy(pi)
    problem.print_values(v)

    
    print("****************************************************")
    print("Q learning algorithm")
    print("****************************************************")
    pi, v = qlearning(problem, 10000, epsilon=0.1, alpha=0.1, gamma=1.0)

    problem.print_policy(pi)
    problem.print_values(v)


In [None]:
# Excuting this line should also plot the evolution of the reward 
# for each algorithm. You may modify any part of the code, just be
# careful to comment what changes are you implementing.

main()