<a href="https://colab.research.google.com/github/nuha18/AI-ML-Breakthrough/blob/main/AI_ML_Day5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🧠 Reinforcement Learning with the NIM Game
Let's teach our AI how to lose a simple game using Q-learning.

## 🎮 The NIM Game Rules
- Start with 25 sticks.
- Each player takes 1, 2, 3, 4, 5 sticks on their turn.
- The player who takes the **last stick loses**.

We'll train an AI to get dumb over time!

In [24]:
MAX_STICKS = 25
ACTIONS = [1, 2, 3, 4, 5]

## 🧠 Step 1: Create a Q-table
We’ll use a dictionary to store the AI’s knowledge — the expected value (Q) of taking each action in every possible state.

In [25]:
Q = {}

## 🎲 Step 2: Action Choice
Let’s write a function that chooses an action. We’ll use **epsilon-greedy**

In [26]:

import random

def choose_action(state, epsilon):
    if state not in Q:
        Q[state] = {a: 0 for a in ACTIONS}
    if random.random() < epsilon:
        return random.choice([a for a in ACTIONS if a <= state])
    else:
      action = [a for a in ACTIONS if a <= state]
      if not action:
          return random.choice(ACTIONS)
    return min({a:Q[state] for a in action}, key = lambda a: Q[state][a])


## 💡 Step 3: Q-Value Update Rule
We’ll update the Q-values using this formula:
```
Q(s,a) = Q(s,a) + alpha * (reward + gamma * min(Q(s') - Q(s,a))
```

In [27]:

def update_q(state, action, reward, next_state, alpha=0.1, gamma=0.9):
    if state not in Q:
        Q[state] = {a: 0 for a in ACTIONS}
    if next_state not in Q:
        Q[next_state] = {a: 0 for a in ACTIONS}
    max_q_next = min(Q[next_state].values())
    Q[state][action] += alpha * (reward + gamma * max_q_next - Q[state][action])


## 🔁 Step 4: Training Loop
Now we’ll play lots of games where the AI learns from experience but gets more dumb over time

In [28]:
def train(episodes=10000, epsilon=0.3, alpha=0.1, gamma=0.9):
    for _ in range(episodes):
        state = MAX_STICKS
        last_state, last_action = None, None

        while state > 0:
            action = choose_action(state, epsilon)
            next_state = state - action

            if last_state is not None:
                update_q(last_state, last_action, 0, state, alpha, gamma)

            last_state = state
            last_action = action

            if next_state == 0:
                update_q(state, action, -1, next_state, alpha, gamma)
                break

            valid_opponent_actions = [a for a in ACTIONS if a <= next_state]
            if not valid_opponent_actions:
                update_q(last_state, last_action, 0, next_state, alpha, gamma)
                break

            opponent_action = random.choice(valid_opponent_actions)
            state = next_state - opponent_action

            if state <= 0:
                update_q(last_state, last_action, 1, next_state, alpha, gamma)
                break


## 🚀 Train the AI!

In [29]:
train()

In [30]:
print(Q)

{25: {1: -0.665813948471106, 2: -0.6845847906072849, 3: -0.6954657023869836, 4: -0.7000413677366316, 5: -0.7249038272887491}, 16: {1: -0.767875489279974, 2: -0.7875831198132539, 3: -0.7986009369148037, 4: -0.8099999999999987, 5: -0.7983017546734399}, 10: {1: -0.8400889060927386, 2: -0.8716851301551365, 3: -0.8516037044522361, 4: -0.899999999999999, 5: -0.7940740500570043}, 7: {1: -0.899999999999999, 2: -0.6900160288760596, 3: -0.7124920502834486, 4: -0.6825443750481636, 5: -0.4135423822443195}, 4: {1: -0.4650297953376299, 2: -0.3021498932480863, 3: 0.09999999968997327, 4: -0.9999999999999996, 5: 0}, 2: {1: 0.10000000000001696, 2: -0.9999999999999996, 3: 0, 4: 0, 5: 0}, 23: {1: -0.42441909060238747, 2: -0.7155604597461888, 3: -0.19351491114590536, 4: -0.261625458100728, 5: -0.19755764405759313}, 19: {1: -0.7289445235262316, 2: -0.734814140944574, 3: -0.7489638878651317, 4: -0.7634370776963821, 5: -0.7939758354198566}, 14: {1: -0.7791648219527992, 2: -0.8099999940375344, 3: -0.7787280801

## 🧪 Let’s play against the AI!

In [31]:

def play():
    state = MAX_STICKS
    while state > 0:
        print(f"Sticks left: {state}")
        move = int(input("Your move (1–5): "))
        if move not in [1,2,3,4,5] or move > state:
            print("Invalid move.")
            continue
        state -= move
        if state <= 0:
            print("You took the last stick. You lose!")
            return
        if state in Q:
            ai_move = choose_action(state, epsilon=0)
        else:
            ai_move = random.choice([a for a in ACTIONS if a <= state])
        print(f"AI takes {ai_move} stick(s).")
        state -= ai_move
        if state <= 0:
            print("AI took the last stick. You win!")
            return


In [32]:
play()

Sticks left: 25
Your move (1–5): 1
AI takes 4 stick(s).
Sticks left: 20
Your move (1–5): 1
AI takes 5 stick(s).
Sticks left: 14
Your move (1–5): 1
AI takes 1 stick(s).
Sticks left: 12
Your move (1–5): 2
AI takes 4 stick(s).
Sticks left: 6
Your move (1–5): 5
AI takes 1 stick(s).
AI took the last stick. You win!


## 🎉 Summary
You just trained an agent to play a game using trial-and-error. That’s the magic of Reinforcement Learning!