# Kufundisha RL Kusawazisha Cartpole

Notibuku hii ni sehemu ya [Mtaala wa AI kwa Kompyuta](http://aka.ms/ai-beginners). Imechangiwa na [mafunzo rasmi ya PyTorch](https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html) na [utekelezaji huu wa Cartpole kwa kutumia PyTorch](https://github.com/yc930401/Actor-Critic-pytorch).

Katika mfano huu, tutatumia RL kufundisha modeli kusawazisha fimbo juu ya gari ambalo linaweza kusogea kushoto na kulia kwenye mstari wa usawa. Tutatumia mazingira ya [OpenAI Gym](https://www.gymlibrary.ml/) kuiga hali ya fimbo.

> **Note**: Unaweza kuendesha msimbo wa somo hili kwenye kompyuta yako (mfano, ukitumia Visual Studio Code), ambapo uigaji utafunguka kwenye dirisha jipya. Unapoendesha msimbo mtandaoni, huenda ukahitaji kufanya marekebisho madogo kwenye msimbo, kama ilivyoelezwa [hapa](https://towardsdatascience.com/rendering-openai-gym-envs-on-binder-and-google-colab-536f99391cc7).

Tutaanza kwa kuhakikisha Gym imewekwa:


In [None]:
import sys
!{sys.executable} -m pip install gym

Sasa hebu tuunde mazingira ya CartPole na tuone jinsi ya kuyatumia. Mazingira yana sifa zifuatazo:

* **Action space** ni seti ya hatua zinazowezekana ambazo tunaweza kuchukua katika kila hatua ya simulizi
* **Observation space** ni nafasi ya uchunguzi tunaoweza kufanya


In [None]:
import gym

env = gym.make("CartPole-v1")

print(f"Action space: {env.action_space}")
print(f"Observation space: {env.observation_space}")

Hebu tuone jinsi simulizi inavyofanya kazi. Kitanzi kifuatacho kinaendesha simulizi, hadi `env.step` hairejeshi bendera ya kukamilisha `done`. Tutachagua hatua kwa nasibu kwa kutumia `env.action_space.sample()`, ambayo inamaanisha jaribio litashindwa haraka sana (mazingira ya CartPole hukamilika pale kasi ya CartPole, nafasi yake au pembe yake zinapokuwa nje ya mipaka fulani).

> Simulizi itafunguka katika dirisha jipya. Unaweza kuendesha msimbo mara kadhaa na kuona jinsi inavyotenda.


In [None]:
env.reset()

done = False
total_reward = 0
while not done:
   env.render()
   obs, rew, done, info = env.step(env.action_space.sample())
   total_reward += rew
   print(f"{obs} -> {rew}")
print(f"Total reward: {total_reward}")

Unaweza kugundua kuwa uchunguzi una namba 4. Hizi ni:
- Nafasi ya gari
- Kasi ya gari
- Pembe ya nguzo
- Kiwango cha mzunguko wa nguzo

`rew` ni zawadi tunayoipokea katika kila hatua. Unaweza kuona kwamba katika mazingira ya CartPole unapewa alama 1 kwa kila hatua ya simulizi, na lengo ni kuongeza jumla ya zawadi, yaani muda ambao CartPole inaweza kusawazika bila kuanguka.

Wakati wa kujifunza kwa kuimarisha, lengo letu ni kufundisha **sera** $\pi$, ambayo kwa kila hali $s$ itatuambia ni hatua gani $a$ tuchukue, kwa hivyo kimsingi $a = \pi(s)$.

Ikiwa unataka suluhisho la uwezekano, unaweza kufikiria sera kama inavyorudisha seti ya uwezekano kwa kila hatua, yaani $\pi(a|s)$ ingemaanisha uwezekano kwamba tunapaswa kuchukua hatua $a$ katika hali $s$.

## Njia ya Gradient ya Sera

Katika algorithimu rahisi zaidi ya RL, inayoitwa **Policy Gradient**, tutafundisha mtandao wa neva kutabiri hatua inayofuata.


In [None]:
import numpy as np
import matplotlib.pyplot as plt
import torch

num_inputs = 4
num_actions = 2

model = torch.nn.Sequential(
    torch.nn.Linear(num_inputs, 128, bias=False, dtype=torch.float32),
    torch.nn.ReLU(),
    torch.nn.Linear(128, num_actions, bias = False, dtype=torch.float32),
    torch.nn.Softmax(dim=1)
)

Tutafundisha mtandao kwa kufanya majaribio mengi, na kusasisha mtandao wetu baada ya kila jaribio. Hebu tueleze kazi ambayo itaendesha jaribio na kurudisha matokeo (inayoitwa **trace**) - hali zote, vitendo (na uwezekano wao uliopendekezwa), na zawadi:


In [None]:
def run_episode(max_steps_per_episode = 10000,render=False):    
    states, actions, probs, rewards = [],[],[],[]
    state = env.reset()
    for _ in range(max_steps_per_episode):
        if render:
            env.render()
        action_probs = model(torch.from_numpy(np.expand_dims(state,0)))[0]
        action = np.random.choice(num_actions, p=np.squeeze(action_probs.detach().numpy()))
        nstate, reward, done, info = env.step(action)
        if done:
            break
        states.append(state)
        actions.append(action)
        probs.append(action_probs.detach().numpy())
        rewards.append(reward)
        state = nstate
    return np.vstack(states), np.vstack(actions), np.vstack(probs), np.vstack(rewards)

Unaweza kuendesha kipindi kimoja na mtandao usiofunzwa na kuona kuwa jumla ya zawadi (yaani urefu wa kipindi) ni ndogo sana:


In [None]:
s, a, p, r = run_episode()
print(f"Total reward: {np.sum(r)}")

Moja ya mambo magumu ya algoriti ya sera ya gradienti ni kutumia **zawadi zilizopunguzwa**. Wazo ni kwamba tunahesabu vekta ya jumla ya zawadi katika kila hatua ya mchezo, na wakati wa mchakato huu tunapunguza zawadi za awali kwa kutumia kipengele fulani $gamma$. Pia tunanormalisha vekta inayopatikana, kwa sababu tuta itumia kama uzito wa kuathiri mafunzo yetu:


In [None]:
eps = 0.0001

def discounted_rewards(rewards,gamma=0.99,normalize=True):
    ret = []
    s = 0
    for r in rewards[::-1]:
        s = r + gamma * s
        ret.insert(0, s)
    if normalize:
        ret = (ret-np.mean(ret))/(np.std(ret)+eps)
    return ret

Sasa tuanze mafunzo halisi! Tutakimbia vipindi 300, na katika kila kipindi tutafanya yafuatayo:

1. Endesha jaribio na kukusanya mfuatano wa matukio.
2. Hesabu tofauti (`gradients`) kati ya hatua zilizochukuliwa na uwezekano uliotabiriwa. Kadri tofauti inavyokuwa ndogo, ndivyo tunavyokuwa na uhakika zaidi kwamba tumefanya hatua sahihi.
3. Hesabu zawadi zilizopunguzwa na zidisha gradients kwa zawadi hizo zilizopunguzwa - hii itahakikisha kwamba hatua zilizo na zawadi kubwa zaidi zitakuwa na athari kubwa kwenye matokeo ya mwisho kuliko zile zilizo na zawadi ndogo.
4. Hatua lengwa zinazotarajiwa kwa mtandao wetu wa neva zitachukuliwa kwa sehemu kutoka kwa uwezekano uliotabiriwa wakati wa mchakato, na kwa sehemu kutoka kwa gradients zilizohesabiwa. Tutatumia kipengele `alpha` kuamua kwa kiwango gani gradients na zawadi zinazingatiwa - hii inaitwa *kiwango cha kujifunza* cha algorithimu ya uimarishaji.
5. Hatimaye, tunafundisha mtandao wetu kwa hali na hatua zinazotarajiwa, kisha tunarudia mchakato.


In [None]:
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

def train_on_batch(x, y):
    x = torch.from_numpy(x)
    y = torch.from_numpy(y)
    optimizer.zero_grad()
    predictions = model(x)
    loss = -torch.mean(torch.log(predictions) * y)
    loss.backward()
    optimizer.step()
    return loss

In [None]:
alpha = 1e-4

history = []
for epoch in range(300):
    states, actions, probs, rewards = run_episode()
    one_hot_actions = np.eye(2)[actions.T][0]
    gradients = one_hot_actions-probs
    dr = discounted_rewards(rewards)
    gradients *= dr
    target = alpha*np.vstack([gradients])+probs
    train_on_batch(states,target)
    history.append(np.sum(rewards))
    if epoch%100==0:
        print(f"{epoch} -> {np.sum(rewards)}")

plt.plot(history)

Sasa wacha tuendeshe kipindi na uonyeshaji ili kuona matokeo:


In [None]:
_ = run_episode(render=True)

Tunaweza kuona kwamba sasa pole inaweza kusawazika vizuri!

## Mfano wa Actor-Critic

Mfano wa Actor-Critic ni maendeleo zaidi ya gradients za sera, ambapo tunajenga mtandao wa neva ili kujifunza sera na thawabu zinazokadiriwa. Mtandao huu utakuwa na matokeo mawili (au unaweza kuiona kama mitandao miwili tofauti):
* **Actor** itapendekeza hatua ya kuchukua kwa kutupa usambazaji wa uwezekano wa hali, kama ilivyo kwenye mfano wa gradient ya sera.
* **Critic** itakadiria thawabu zitakazotokana na hatua hizo. Inarudisha jumla ya thawabu zinazokadiriwa za baadaye katika hali iliyopo.

Hebu tuelezee mfano kama huo:


In [None]:
from itertools import count
import torch.nn.functional as F

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
env = gym.make("CartPole-v1")

state_size = env.observation_space.shape[0]
action_size = env.action_space.n
lr = 0.0001

class Actor(torch.nn.Module):
    def __init__(self, state_size, action_size):
        super(Actor, self).__init__()
        self.state_size = state_size
        self.action_size = action_size
        self.linear1 = torch.nn.Linear(self.state_size, 128)
        self.linear2 = torch.nn.Linear(128, 256)
        self.linear3 = torch.nn.Linear(256, self.action_size)

    def forward(self, state):
        output = F.relu(self.linear1(state))
        output = F.relu(self.linear2(output))
        output = self.linear3(output)
        distribution = torch.distributions.Categorical(F.softmax(output, dim=-1))
        return distribution


class Critic(torch.nn.Module):
    def __init__(self, state_size, action_size):
        super(Critic, self).__init__()
        self.state_size = state_size
        self.action_size = action_size
        self.linear1 = torch.nn.Linear(self.state_size, 128)
        self.linear2 = torch.nn.Linear(128, 256)
        self.linear3 = torch.nn.Linear(256, 1)

    def forward(self, state):
        output = F.relu(self.linear1(state))
        output = F.relu(self.linear2(output))
        value = self.linear3(output)
        return value

Tunahitaji kurekebisha kidogo kazi zetu za `discounted_rewards` na `run_episode`:


In [None]:
def discounted_rewards(next_value, rewards, masks, gamma=0.99):
    R = next_value
    returns = []
    for step in reversed(range(len(rewards))):
        R = rewards[step] + gamma * R * masks[step]
        returns.insert(0, R)
    return returns

def run_episode(actor, critic, n_iters):
    optimizerA = torch.optim.Adam(actor.parameters())
    optimizerC = torch.optim.Adam(critic.parameters())
    for iter in range(n_iters):
        state = env.reset()
        log_probs = []
        values = []
        rewards = []
        masks = []
        entropy = 0
        env.reset()

        for i in count():
            env.render()
            state = torch.FloatTensor(state).to(device)
            dist, value = actor(state), critic(state)

            action = dist.sample()
            next_state, reward, done, _ = env.step(action.cpu().numpy())

            log_prob = dist.log_prob(action).unsqueeze(0)
            entropy += dist.entropy().mean()

            log_probs.append(log_prob)
            values.append(value)
            rewards.append(torch.tensor([reward], dtype=torch.float, device=device))
            masks.append(torch.tensor([1-done], dtype=torch.float, device=device))

            state = next_state

            if done:
                print('Iteration: {}, Score: {}'.format(iter, i))
                break


        next_state = torch.FloatTensor(next_state).to(device)
        next_value = critic(next_state)
        returns = discounted_rewards(next_value, rewards, masks)

        log_probs = torch.cat(log_probs)
        returns = torch.cat(returns).detach()
        values = torch.cat(values)

        advantage = returns - values

        actor_loss = -(log_probs * advantage.detach()).mean()
        critic_loss = advantage.pow(2).mean()

        optimizerA.zero_grad()
        optimizerC.zero_grad()
        actor_loss.backward()
        critic_loss.backward()
        optimizerA.step()
        optimizerC.step()


Sasa tutaendesha mzunguko mkuu wa mafunzo. Tutatumia mchakato wa mafunzo wa mtandao wa mwongozo kwa kuhesabu kazi sahihi za hasara na kusasisha vigezo vya mtandao:


In [None]:

actor = Actor(state_size, action_size).to(device)
critic = Critic(state_size, action_size).to(device)
run_episode(actor, critic, n_iters=100)

In [None]:
env.close()

## Muhimu

Tumeona mbinu mbili za RL katika onyesho hili: gradient rahisi ya sera, na mbinu ya hali ya juu ya actor-critic. Unaweza kuona kwamba mbinu hizo zinatumia dhana za hali, hatua, na zawadi kwa njia ya kificho - hivyo zinaweza kutumika katika mazingira tofauti kabisa.

Kujifunza kwa kuimarisha kunatuwezesha kujifunza mkakati bora wa kutatua tatizo kwa kuangalia tu zawadi ya mwisho. Ukweli kwamba hatuhitaji seti za data zilizo na lebo unaturuhusu kurudia masimulizi mara nyingi ili kuboresha mifano yetu. Hata hivyo, bado kuna changamoto nyingi katika RL, ambazo unaweza kujifunza ikiwa utaamua kuzingatia zaidi eneo hili la kuvutia la AI.



---

**Kanusho**:  
Hati hii imetafsiriwa kwa kutumia huduma ya tafsiri ya AI [Co-op Translator](https://github.com/Azure/co-op-translator). Ingawa tunajitahidi kuhakikisha usahihi, tafadhali fahamu kuwa tafsiri za kiotomatiki zinaweza kuwa na makosa au kutokuwa sahihi. Hati ya asili katika lugha yake ya awali inapaswa kuchukuliwa kama chanzo cha mamlaka. Kwa taarifa muhimu, inashauriwa kutumia huduma ya tafsiri ya kibinadamu ya kitaalamu. Hatutawajibika kwa maelewano mabaya au tafsiri zisizo sahihi zinazotokana na matumizi ya tafsiri hii.
