# Simple Reinforcement Learning in Tensorflow,

## The Multi-armed bandit ,

For more information, see this [Medium post](https://medium.com/@awjuliani/super-simple-reinforcement-learning-tutorial-part-1-fd544fab149). 

In [35]:
import tensorflow as tf
import numpy as np

### The Bandits
Here we define our bandits. For this example we are using a four-armed bandit. The pullBandit function generates a random number from a normal distribution with a mean of 0. The lower the bandit number, the more likely a positive reward will be returned. We want our agent to learn to always choose the bandit that will give that positive reward.

In [48]:
bandits = [-0.2,0,0.2,-0.6, -0.7, 0.1]
num_bandits = len(bandits)
def pullBandit(bandit):
    #Get a random number.
    result = np.random.randn(1)
    if result > bandit:
        #return a positive reward.
        return 1
    else:
        #return a negative reward.
        return -1

### Define Neural network

In [49]:
learning_rate = 0.001
num_episodes = 5000
e = 0.1 # Chance of taking random action

print_every = 500

tf.reset_default_graph()
# Give weights for each bandit. To being we assume all of them are equally likely to win
weights = tf.Variable(tf.ones([num_bandits]))
next_action = tf.argmax(weights, name="next_action")

reward = tf.placeholder(shape=[1], dtype=tf.float32)
action = tf.placeholder(shape=[1], dtype=tf.int32) # Represents index in banits array
policy = tf.slice(weights, action, [1]) # Our policy here is nothing but our weight
loss = -(tf.log(policy) * reward)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
update = optimizer.minimize(loss)


### Train 

In [50]:
bandit_rewards = np.zeros(num_bandits)

init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    i = 0
    while i < num_episodes:
        if np.random.rand(1) < e:
            next_a = np.random.randint(num_bandits)
        else:
            next_a = sess.run(next_action)
        
        earned_reward = pullBandit(bandits[next_a]) #Get our reward from picking one of the bandits.
        
        _,pol,ww = sess.run([update,policy,weights], feed_dict={reward:[earned_reward],action:[next_a]})
        
        bandit_rewards[next_a] += earned_reward
        
        if i % print_every == 0:
            print str(i) + " Episodes Complete: Rewards " + str(bandit_rewards)
        
        i += 1
        
print "Winning bandit: " + str(np.argmax(ww))

print "Our Agent is: " + ("right!" if np.argmax(ww) == np.argmax(-np.array(bandits)) else "Wrong!")

0 Episodes Complete: Rewards [-1.  0.  0.  0.  0.  0.]
500 Episodes Complete: Rewards [   5.   -4.   -3.    6.  220.    1.]
1000 Episodes Complete: Rewards [   5.   -3.  -10.    9.  448.   -2.]
1500 Episodes Complete: Rewards [   8.    3.  -15.   14.  699.   -6.]
2000 Episodes Complete: Rewards [  10.    2.  -11.   19.  944.   -7.]
2500 Episodes Complete: Rewards [   11.     4.   -14.    20.  1170.    -8.]
3000 Episodes Complete: Rewards [   12.     5.   -11.    23.  1427.   -11.]
3500 Episodes Complete: Rewards [   10.    12.   -10.    26.  1656.    -7.]
4000 Episodes Complete: Rewards [    9.    10.   -10.    30.  1902.   -10.]
4500 Episodes Complete: Rewards [    9.    11.   -11.    36.  2133.   -11.]
Winning bandit: 4
Our Agent is: right!
